Part I - (FordBike Trip Analysis)

by (Abiodun Azeez)

Introduction:

This data set includes information about individual rides made in a bike-sharing system covering the greater San Francisco Bay area within the month of february-march 2019.

Preliminary Wrangling

In [1]:
# import all packages and set plots to be embedded inline
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sb
import plotly.express as px

%matplotlib inline
In [2]:
# load in the dataset into a pandas dataframe
bikes = pd.read_csv('201902-fordgobike-tripdata.csv')
In [3]:
#viewing the first 5 rows of the dataset
bikes.head()
Out[3]:
duration_sec start_time end_time start_station_id start_station_name start_station_latitude start_station_longitude end_station_id end_station_name end_station_latitude end_station_longitude bike_id user_type member_birth_year member_gender bike_share_for_all_trip
0 52185 2019-02-28 17:32:10.1450 2019-03-01 08:01:55.9750 21.0 Montgomery St BART Station (Market St at 2nd St) 37.789625 -122.400811 13.0 Commercial St at Montgomery St 37.794231 -122.402923 4902 Customer 1984.0 Male No
1 42521 2019-02-28 18:53:21.7890 2019-03-01 06:42:03.0560 23.0 The Embarcadero at Steuart St 37.791464 -122.391034 81.0 Berry St at 4th St 37.775880 -122.393170 2535 Customer NaN NaN No
2 61854 2019-02-28 12:13:13.2180 2019-03-01 05:24:08.1460 86.0 Market St at Dolores St 37.769305 -122.426826 3.0 Powell St BART Station (Market St at 4th St) 37.786375 -122.404904 5905 Customer 1972.0 Male No
3 36490 2019-02-28 17:54:26.0100 2019-03-01 04:02:36.8420 375.0 Grove St at Masonic Ave 37.774836 -122.446546 70.0 Central Ave at Fell St 37.773311 -122.444293 6638 Subscriber 1989.0 Other No
4 1585 2019-02-28 23:54:18.5490 2019-03-01 00:20:44.0740 7.0 Frank H Ogawa Plaza 37.804562 -122.271738 222.0 10th Ave at E 15th St 37.792714 -122.248780 4898 Subscriber 1974.0 Male Yes
In [4]:
# checking for the last 5 ROWS 
bikes.tail()
Out[4]:
duration_sec start_time end_time start_station_id start_station_name start_station_latitude start_station_longitude end_station_id end_station_name end_station_latitude end_station_longitude bike_id user_type member_birth_year member_gender bike_share_for_all_trip
183407 480 2019-02-01 00:04:49.7240 2019-02-01 00:12:50.0340 27.0 Beale St at Harrison St 37.788059 -122.391865 324.0 Union Square (Powell St at Post St) 37.788300 -122.408531 4832 Subscriber 1996.0 Male No
183408 313 2019-02-01 00:05:34.7440 2019-02-01 00:10:48.5020 21.0 Montgomery St BART Station (Market St at 2nd St) 37.789625 -122.400811 66.0 3rd St at Townsend St 37.778742 -122.392741 4960 Subscriber 1984.0 Male No
183409 141 2019-02-01 00:06:05.5490 2019-02-01 00:08:27.2200 278.0 The Alameda at Bush St 37.331932 -121.904888 277.0 Morrison Ave at Julian St 37.333658 -121.908586 3824 Subscriber 1990.0 Male Yes
183410 139 2019-02-01 00:05:34.3600 2019-02-01 00:07:54.2870 220.0 San Pablo Ave at MLK Jr Way 37.811351 -122.273422 216.0 San Pablo Ave at 27th St 37.817827 -122.275698 5095 Subscriber 1988.0 Male No
183411 271 2019-02-01 00:00:20.6360 2019-02-01 00:04:52.0580 24.0 Spear St at Folsom St 37.789677 -122.390428 37.0 2nd St at Folsom St 37.785000 -122.395936 1057 Subscriber 1989.0 Male No
In [5]:
#structure of the dataset
bikes.shape
Out[5]:
(183412, 16)

This dataset is made up of 183412 ROWS and 16 COLUMNS

In [6]:
#checking for columns with missing data
bikes.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 183412 entries, 0 to 183411
Data columns (total 16 columns):
 #   Column                   Non-Null Count   Dtype  
---  ------                   --------------   -----  
 0   duration_sec             183412 non-null  int64  
 1   start_time               183412 non-null  object 
 2   end_time                 183412 non-null  object 
 3   start_station_id         183215 non-null  float64
 4   start_station_name       183215 non-null  object 
 5   start_station_latitude   183412 non-null  float64
 6   start_station_longitude  183412 non-null  float64
 7   end_station_id           183215 non-null  float64
 8   end_station_name         183215 non-null  object 
 9   end_station_latitude     183412 non-null  float64
 10  end_station_longitude    183412 non-null  float64
 11  bike_id                  183412 non-null  int64  
 12  user_type                183412 non-null  object 
 13  member_birth_year        175147 non-null  float64
 14  member_gender            175147 non-null  object 
 15  bike_share_for_all_trip  183412 non-null  object 
dtypes: float64(7), int64(2), object(7)
memory usage: 22.4+ MB

There are 183412 entries present in the dataset,with missing data in the several columns such as start_station_id,start_station_name,end_station_id,end_station_name,member_birth_year and member_gender.

In [7]:
#DataTypes present in the dataset
bikes.dtypes
Out[7]:
duration_sec                 int64
start_time                  object
end_time                    object
start_station_id           float64
start_station_name          object
start_station_latitude     float64
start_station_longitude    float64
end_station_id             float64
end_station_name            object
end_station_latitude       float64
end_station_longitude      float64
bike_id                      int64
user_type                   object
member_birth_year          float64
member_gender               object
bike_share_for_all_trip     object
dtype: object

The bike dataset spread across 7 (float64),7 (object) and 2(int64) dataTypes.

In [8]:
# descriptive statistics for numeric variables
bikes.describe()
Out[8]:
duration_sec start_station_id start_station_latitude start_station_longitude end_station_id end_station_latitude end_station_longitude bike_id member_birth_year
count 183412.000000 183215.000000 183412.000000 183412.000000 183215.000000 183412.000000 183412.000000 183412.000000 175147.000000
mean 726.078435 138.590427 37.771223 -122.352664 136.249123 37.771427 -122.352250 4472.906375 1984.806437
std 1794.389780 111.778864 0.099581 0.117097 111.515131 0.099490 0.116673 1664.383394 10.116689
min 61.000000 3.000000 37.317298 -122.453704 3.000000 37.317298 -122.453704 11.000000 1878.000000
25% 325.000000 47.000000 37.770083 -122.412408 44.000000 37.770407 -122.411726 3777.000000 1980.000000
50% 514.000000 104.000000 37.780760 -122.398285 100.000000 37.781010 -122.398279 4958.000000 1987.000000
75% 796.000000 239.000000 37.797280 -122.286533 235.000000 37.797320 -122.288045 5502.000000 1992.000000
max 85444.000000 398.000000 37.880222 -121.874119 398.000000 37.880222 -121.874119 6645.000000 2001.000000
In [9]:
#checking for the total sum of missing data present in the dataset columns 
missing_data = bikes.isnull().sum()
missing_data
Out[9]:
duration_sec                  0
start_time                    0
end_time                      0
start_station_id            197
start_station_name          197
start_station_latitude        0
start_station_longitude       0
end_station_id              197
end_station_name            197
end_station_latitude          0
end_station_longitude         0
bike_id                       0
user_type                     0
member_birth_year          8265
member_gender              8265
bike_share_for_all_trip       0
dtype: int64

It is seen that we have missing data across 6 different columns in the dataset,which comprises of :

  • 8265 data missing in both member_birth_year and member_gender.

  • 197 data missing in 4 different columns start_station_id,start_station_name,end_station_id and start_station_name.

In [10]:
bikes.duplicated().sum()
Out[10]:
0

NO DUPLICATE IN THE DATASET

In [11]:
# percentage of missing data present
total_cell = np.product(bikes.shape)
total_missing_data = missing_data.sum()

(total_missing_data/total_cell) * 100
Out[11]:
0.590133142869605

The percentage(%) of the missing data is not up to 1%.....so therefore dropping the ROWS with missing data wouldn't be a Bad idea.

In [12]:
bikes1 = bikes.copy()

Data Wrangling:

Define:

Changing the DataTypes of some Certain columns such as:
    *start_station_id from float to object.
    *end_station_id from float to object.
    *member_birth_year from float to object.

Code:

In [13]:
bikes1['start_station_id'] = bikes1['start_station_id'].astype(object)
bikes1['end_station_id'] = bikes1['end_station_id'].astype(object)
bikes1['member_birth_year'] = bikes1['member_birth_year'].astype(object)

Test:

In [14]:
print(bikes1['start_station_id'].dtypes)
print(bikes1['end_station_id'].dtypes)
print(bikes1['member_birth_year'].dtypes)
object
object
object

Define:

Convert the DataType of these columns to order Categorical dtpyes
    *user_type from object to categorical.
    *member_gender from object to categorical.
    *bike_share_for_all_trip from object to categorical.

Code:

In [15]:
ordinal_var_dict = {'user_type': ['Customer','Subscriber'],
                    'member_gender': ['Male', 'Female','Other'],
                    'bike_share_for_all_trip': ['No', 'Yes']}

for i in ordinal_var_dict:
    ordered_var = pd.api.types.CategoricalDtype(ordered = True,
                                                categories = ordinal_var_dict[i])
    bikes1[i] = bikes1[i].astype(ordered_var)

Test:

In [16]:
print(bikes1['user_type'].dtypes)
print(bikes1['member_gender'].dtypes)
print(bikes1['bike_share_for_all_trip'].dtypes)
category
category
category
In [17]:
bikes1.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 183412 entries, 0 to 183411
Data columns (total 16 columns):
 #   Column                   Non-Null Count   Dtype   
---  ------                   --------------   -----   
 0   duration_sec             183412 non-null  int64   
 1   start_time               183412 non-null  object  
 2   end_time                 183412 non-null  object  
 3   start_station_id         183215 non-null  object  
 4   start_station_name       183215 non-null  object  
 5   start_station_latitude   183412 non-null  float64 
 6   start_station_longitude  183412 non-null  float64 
 7   end_station_id           183215 non-null  object  
 8   end_station_name         183215 non-null  object  
 9   end_station_latitude     183412 non-null  float64 
 10  end_station_longitude    183412 non-null  float64 
 11  bike_id                  183412 non-null  int64   
 12  user_type                183412 non-null  category
 13  member_birth_year        175147 non-null  object  
 14  member_gender            175147 non-null  category
 15  bike_share_for_all_trip  183412 non-null  category
dtypes: category(3), float64(4), int64(2), object(7)
memory usage: 18.7+ MB

Define:

Extracting Date from the start_time and end_time column

Code:

In [18]:
bikes1['start_date'] = pd.to_datetime(bikes1['start_time']).dt.date
bikes1['end_date'] = pd.to_datetime(bikes1['end_time']).dt.date

# conversion of start_date/end_date object to datetime dtype
bikes1['start_date'] = pd.to_datetime(bikes1['start_date'])
bikes1['end_date'] = pd.to_datetime(bikes1['end_date'])

Test:

In [19]:
bikes1.head()
Out[19]:
duration_sec start_time end_time start_station_id start_station_name start_station_latitude start_station_longitude end_station_id end_station_name end_station_latitude end_station_longitude bike_id user_type member_birth_year member_gender bike_share_for_all_trip start_date end_date
0 52185 2019-02-28 17:32:10.1450 2019-03-01 08:01:55.9750 21 Montgomery St BART Station (Market St at 2nd St) 37.789625 -122.400811 13 Commercial St at Montgomery St 37.794231 -122.402923 4902 Customer 1984 Male No 2019-02-28 2019-03-01
1 42521 2019-02-28 18:53:21.7890 2019-03-01 06:42:03.0560 23 The Embarcadero at Steuart St 37.791464 -122.391034 81 Berry St at 4th St 37.775880 -122.393170 2535 Customer NaN NaN No 2019-02-28 2019-03-01
2 61854 2019-02-28 12:13:13.2180 2019-03-01 05:24:08.1460 86 Market St at Dolores St 37.769305 -122.426826 3 Powell St BART Station (Market St at 4th St) 37.786375 -122.404904 5905 Customer 1972 Male No 2019-02-28 2019-03-01
3 36490 2019-02-28 17:54:26.0100 2019-03-01 04:02:36.8420 375 Grove St at Masonic Ave 37.774836 -122.446546 70 Central Ave at Fell St 37.773311 -122.444293 6638 Subscriber 1989 Other No 2019-02-28 2019-03-01
4 1585 2019-02-28 23:54:18.5490 2019-03-01 00:20:44.0740 7 Frank H Ogawa Plaza 37.804562 -122.271738 222 10th Ave at E 15th St 37.792714 -122.248780 4898 Subscriber 1974 Male Yes 2019-02-28 2019-03-01
In [20]:
bikes1.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 183412 entries, 0 to 183411
Data columns (total 18 columns):
 #   Column                   Non-Null Count   Dtype         
---  ------                   --------------   -----         
 0   duration_sec             183412 non-null  int64         
 1   start_time               183412 non-null  object        
 2   end_time                 183412 non-null  object        
 3   start_station_id         183215 non-null  object        
 4   start_station_name       183215 non-null  object        
 5   start_station_latitude   183412 non-null  float64       
 6   start_station_longitude  183412 non-null  float64       
 7   end_station_id           183215 non-null  object        
 8   end_station_name         183215 non-null  object        
 9   end_station_latitude     183412 non-null  float64       
 10  end_station_longitude    183412 non-null  float64       
 11  bike_id                  183412 non-null  int64         
 12  user_type                183412 non-null  category      
 13  member_birth_year        175147 non-null  object        
 14  member_gender            175147 non-null  category      
 15  bike_share_for_all_trip  183412 non-null  category      
 16  start_date               183412 non-null  datetime64[ns]
 17  end_date                 183412 non-null  datetime64[ns]
dtypes: category(3), datetime64[ns](2), float64(4), int64(2), object(7)
memory usage: 21.5+ MB

Define:

Dropping every missing data present in the dataset.

Code:

In [21]:
# remove all the rows that contain a missing value
bikes1 = bikes1.dropna()
bikes1
Out[21]:
duration_sec start_time end_time start_station_id start_station_name start_station_latitude start_station_longitude end_station_id end_station_name end_station_latitude end_station_longitude bike_id user_type member_birth_year member_gender bike_share_for_all_trip start_date end_date
0 52185 2019-02-28 17:32:10.1450 2019-03-01 08:01:55.9750 21 Montgomery St BART Station (Market St at 2nd St) 37.789625 -122.400811 13 Commercial St at Montgomery St 37.794231 -122.402923 4902 Customer 1984 Male No 2019-02-28 2019-03-01
2 61854 2019-02-28 12:13:13.2180 2019-03-01 05:24:08.1460 86 Market St at Dolores St 37.769305 -122.426826 3 Powell St BART Station (Market St at 4th St) 37.786375 -122.404904 5905 Customer 1972 Male No 2019-02-28 2019-03-01
3 36490 2019-02-28 17:54:26.0100 2019-03-01 04:02:36.8420 375 Grove St at Masonic Ave 37.774836 -122.446546 70 Central Ave at Fell St 37.773311 -122.444293 6638 Subscriber 1989 Other No 2019-02-28 2019-03-01
4 1585 2019-02-28 23:54:18.5490 2019-03-01 00:20:44.0740 7 Frank H Ogawa Plaza 37.804562 -122.271738 222 10th Ave at E 15th St 37.792714 -122.248780 4898 Subscriber 1974 Male Yes 2019-02-28 2019-03-01
5 1793 2019-02-28 23:49:58.6320 2019-03-01 00:19:51.7600 93 4th St at Mission Bay Blvd S 37.770407 -122.391198 323 Broadway at Kearny 37.798014 -122.405950 5200 Subscriber 1959 Male No 2019-02-28 2019-03-01
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
183407 480 2019-02-01 00:04:49.7240 2019-02-01 00:12:50.0340 27 Beale St at Harrison St 37.788059 -122.391865 324 Union Square (Powell St at Post St) 37.788300 -122.408531 4832 Subscriber 1996 Male No 2019-02-01 2019-02-01
183408 313 2019-02-01 00:05:34.7440 2019-02-01 00:10:48.5020 21 Montgomery St BART Station (Market St at 2nd St) 37.789625 -122.400811 66 3rd St at Townsend St 37.778742 -122.392741 4960 Subscriber 1984 Male No 2019-02-01 2019-02-01
183409 141 2019-02-01 00:06:05.5490 2019-02-01 00:08:27.2200 278 The Alameda at Bush St 37.331932 -121.904888 277 Morrison Ave at Julian St 37.333658 -121.908586 3824 Subscriber 1990 Male Yes 2019-02-01 2019-02-01
183410 139 2019-02-01 00:05:34.3600 2019-02-01 00:07:54.2870 220 San Pablo Ave at MLK Jr Way 37.811351 -122.273422 216 San Pablo Ave at 27th St 37.817827 -122.275698 5095 Subscriber 1988 Male No 2019-02-01 2019-02-01
183411 271 2019-02-01 00:00:20.6360 2019-02-01 00:04:52.0580 24 Spear St at Folsom St 37.789677 -122.390428 37 2nd St at Folsom St 37.785000 -122.395936 1057 Subscriber 1989 Male No 2019-02-01 2019-02-01

174952 rows × 18 columns

Test:

In [22]:
bikes1.shape
Out[22]:
(174952, 18)
In [23]:
bikes1.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 174952 entries, 0 to 183411
Data columns (total 18 columns):
 #   Column                   Non-Null Count   Dtype         
---  ------                   --------------   -----         
 0   duration_sec             174952 non-null  int64         
 1   start_time               174952 non-null  object        
 2   end_time                 174952 non-null  object        
 3   start_station_id         174952 non-null  object        
 4   start_station_name       174952 non-null  object        
 5   start_station_latitude   174952 non-null  float64       
 6   start_station_longitude  174952 non-null  float64       
 7   end_station_id           174952 non-null  object        
 8   end_station_name         174952 non-null  object        
 9   end_station_latitude     174952 non-null  float64       
 10  end_station_longitude    174952 non-null  float64       
 11  bike_id                  174952 non-null  int64         
 12  user_type                174952 non-null  category      
 13  member_birth_year        174952 non-null  object        
 14  member_gender            174952 non-null  category      
 15  bike_share_for_all_trip  174952 non-null  category      
 16  start_date               174952 non-null  datetime64[ns]
 17  end_date                 174952 non-null  datetime64[ns]
dtypes: category(3), datetime64[ns](2), float64(4), int64(2), object(7)
memory usage: 21.9+ MB

Define:

Convert member_birth_year to member_age

Code:

In [28]:
# using now == 2019,because that was when the bike trip occured in the dataset
now = 2019
# 2. Create ages
bikes1['member_age'] = bikes1['member_birth_year'].apply(lambda x: 2019 - x)
#convert to integer
bikes1['member_age'] = bikes1['member_age'].astype('int')

Test:

In [29]:
#bikes = bikes
bikes1.head()
Out[29]:
duration_sec start_time end_time start_station_id start_station_name start_station_latitude start_station_longitude end_station_id end_station_name end_station_latitude end_station_longitude bike_id user_type member_birth_year member_gender bike_share_for_all_trip start_date end_date member_age
0 52185 2019-02-28 17:32:10.1450 2019-03-01 08:01:55.9750 21 Montgomery St BART Station (Market St at 2nd St) 37.789625 -122.400811 13 Commercial St at Montgomery St 37.794231 -122.402923 4902 Customer 1984 Male No 2019-02-28 2019-03-01 35
2 61854 2019-02-28 12:13:13.2180 2019-03-01 05:24:08.1460 86 Market St at Dolores St 37.769305 -122.426826 3 Powell St BART Station (Market St at 4th St) 37.786375 -122.404904 5905 Customer 1972 Male No 2019-02-28 2019-03-01 47
3 36490 2019-02-28 17:54:26.0100 2019-03-01 04:02:36.8420 375 Grove St at Masonic Ave 37.774836 -122.446546 70 Central Ave at Fell St 37.773311 -122.444293 6638 Subscriber 1989 Other No 2019-02-28 2019-03-01 30
4 1585 2019-02-28 23:54:18.5490 2019-03-01 00:20:44.0740 7 Frank H Ogawa Plaza 37.804562 -122.271738 222 10th Ave at E 15th St 37.792714 -122.248780 4898 Subscriber 1974 Male Yes 2019-02-28 2019-03-01 45
5 1793 2019-02-28 23:49:58.6320 2019-03-01 00:19:51.7600 93 4th St at Mission Bay Blvd S 37.770407 -122.391198 323 Broadway at Kearny 37.798014 -122.405950 5200 Subscriber 1959 Male No 2019-02-28 2019-03-01 60
In [30]:
bikes1.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 174952 entries, 0 to 183411
Data columns (total 19 columns):
 #   Column                   Non-Null Count   Dtype         
---  ------                   --------------   -----         
 0   duration_sec             174952 non-null  int64         
 1   start_time               174952 non-null  object        
 2   end_time                 174952 non-null  object        
 3   start_station_id         174952 non-null  object        
 4   start_station_name       174952 non-null  object        
 5   start_station_latitude   174952 non-null  float64       
 6   start_station_longitude  174952 non-null  float64       
 7   end_station_id           174952 non-null  object        
 8   end_station_name         174952 non-null  object        
 9   end_station_latitude     174952 non-null  float64       
 10  end_station_longitude    174952 non-null  float64       
 11  bike_id                  174952 non-null  int64         
 12  user_type                174952 non-null  category      
 13  member_birth_year        174952 non-null  object        
 14  member_gender            174952 non-null  category      
 15  bike_share_for_all_trip  174952 non-null  category      
 16  start_date               174952 non-null  datetime64[ns]
 17  end_date                 174952 non-null  datetime64[ns]
 18  member_age               174952 non-null  int32         
dtypes: category(3), datetime64[ns](2), float64(4), int32(1), int64(2), object(7)
memory usage: 22.5+ MB
In [31]:
bikes1.describe()
Out[31]:
duration_sec start_station_latitude start_station_longitude end_station_latitude end_station_longitude bike_id member_age
count 174952.000000 174952.000000 174952.000000 174952.000000 174952.000000 174952.000000 174952.000000
mean 704.002744 37.771220 -122.351760 37.771414 -122.351335 4482.587555 34.196865
std 1642.204905 0.100391 0.117732 0.100295 0.117294 1659.195937 10.118731
min 61.000000 37.317298 -122.453704 37.317298 -122.453704 11.000000 18.000000
25% 323.000000 37.770407 -122.411901 37.770407 -122.411647 3799.000000 27.000000
50% 510.000000 37.780760 -122.398279 37.781010 -122.397437 4960.000000 32.000000
75% 789.000000 37.797320 -122.283093 37.797673 -122.286533 5505.000000 39.000000
max 84548.000000 37.880222 -121.874119 37.880222 -121.874119 6645.000000 141.000000
In [ ]:
#saving new datasets to csv
bikes1.to_csv('cleaned bike dateset.csv',index = False)

What is the structure of your dataset?

There are 183412 ROWS present in the bikes' dataset which is featured with 16 COLUMNS comprises of (duration_sec,start_time,end_time,start_station_id,start_station_name,start_station_latitude, start_station_longitude,end_station_id,end_station_name,end_station_latitude,end_station_longitude, bike_id,user_type,member_birth_year,member_gender,bike_share_for_all_trip).Majority of the variables present in this dataset are Numeric in nature, except for start_station_name and end_station_name which depicts address of the start station and end station,with categorical variables such as:

    * user_type:['Customer','Subscriber']
    * member_gender:['Male', 'Female','Other']
    * bike_share_for_all_trip: ['No', 'Yes']

Though after cleaning my data new features where added such as Member_age,Start_date and End_Date.

What is/are the main feature(s) of interest in your dataset?

My main features of interest is knowing the start time/end time of bike trip duration,also knowing the location of most frequent bike start station/end station and the major charateristics that can affect the duration of bike trip such as age,gender user_type and bike_share_for_all_trip.

What features in the dataset do you think will help support your investigation into your feature(s) of interest?

Firstly duration trip to depict location(address) of frequent start bike station,age,gender and user type in which Bike was shared the most.

Univariate Exploration

Top 10 Start_Date with distributed(shared) Bikes in SF:

In [33]:
# top ten start_date value_counts
date_top10 = bikes1.start_date.value_counts()[:10]
date_top10
Out[33]:
2019-02-28    9448
2019-02-20    9246
2019-02-21    9120
2019-02-19    9096
2019-02-07    8798
2019-02-22    8765
2019-02-06    8655
2019-02-11    8315
2019-02-12    8155
2019-02-05    8136
Name: start_date, dtype: int64
In [34]:
#barplot showing Top 10 start Date with the most shared bike in SF
base_color = sb.color_palette()[0]
date_top10 = bikes1.start_date.value_counts()[:10].index
sb.countplot(data = bikes1, x = 'start_date', color = base_color, order = date_top10)
plt.xticks(rotation = 90)
plt.xlabel('Count of Shared Bikes')
plt.ylabel('Trip start Date')
plt.title('Top 10 start Date with the most distributed bike in SF');

The above bar-plot shows a total value count for Top 10 dates ONLY in the month of FEBRUARY with most Bike shared day on the 28th of FEBRUARY 2019 has more value_count with 9448 counts (maybe because it was the last day of the month) and 5th of FEBRUARY with the lowest value_count with 8136 counts

Top 10 start_station_name with more distributed(shared) Bikes in SF:

In [35]:
# top ten most common start station that distributed bikes the more
start_station_top10 = bikes1.start_station_name.value_counts()[:10]
start_station_top10
Out[35]:
Market St at 10th St                                         3649
San Francisco Caltrain Station 2  (Townsend St at 4th St)    3408
Berry St at 4th St                                           2952
Montgomery St BART Station (Market St at 2nd St)             2711
Powell St BART Station (Market St at 4th St)                 2620
San Francisco Caltrain (Townsend St at 4th St)               2577
San Francisco Ferry Building (Harry Bridges Plaza)           2541
Howard St at Beale St                                        2216
Steuart St at Market St                                      2191
Powell St BART Station (Market St at 5th St)                 2144
Name: start_station_name, dtype: int64
In [36]:
#bar plot showing ten most common start station for Bike distribution 
start_station_top10 = bikes1.start_station_name.value_counts()[:10].index
base_color = sb.color_palette()[0]
sb.countplot(data = bikes1, y = 'start_station_name', color = base_color, order = start_station_top10)
plt.xticks(rotation = 90)
plt.xlabel('Count of shared Bikes')
plt.ylabel('Start Station Name')
plt.title('Top 10 Stations with the most shared Bikes in SF');

The above bar-plot shows Top 10 distribution of Start Station Name(location name) with most common start bike trip in San Fracisco with "Market St at 10th St" location having a total value_count of 3649 and "Powell St BART Station (Market St at 5th St)" with 2144 counts.

Proportions of shared Bikes for Gender:

In [37]:
#value count for member_gender
gender_counts = bikes1['member_gender'].value_counts()

#arranging the gender counts in descending order with .index
gender_order = gender_counts.index
In [38]:
#calculating the max_proportion in gender_counts
n_bikes = bikes1.shape[0]
max_gender_count = gender_counts[0]
max_prop = max_gender_count / n_bikes
print(max_prop)
0.7459188806072523
In [39]:
# Create an array of evenly spaced proportioned values
tick_props = np.arange(0,max_prop+0.1,0.1)
tick_props
Out[39]:
array([0. , 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8])
In [40]:
#Create a list of String values that can be used as tick labels.
tick_names = ['{:0.2f}'.format(v) for v in tick_props]
tick_names
Out[40]:
['0.00', '0.10', '0.20', '0.30', '0.40', '0.50', '0.60', '0.70', '0.80']
In [41]:
#Plot the bar chart, with new x-tick labels
sb.countplot(data=bikes1, y='member_gender', color=base_color, order=gender_order);
# Change the tick locations and labels
plt.xticks(tick_props * n_bikes, tick_names)
plt.xlabel('proportion');
In [42]:
#Print the text (proportion) on the bars of a horizontal plot.
base_color = sb.color_palette()[0]
sb.countplot(data=bikes1, y='member_gender', color=base_color, order=gender_order);

# Logic to print the proportion text on the bars
for i in range (gender_counts.shape[0]):
    # Remember, gender_counts contains the frequency of unique values in the `member_gender` column in decreasing order.
    count = gender_counts[i]
    # Convert count into a percentage, and then into string
    pct_string = '{:0.1f}%'.format(100*count/n_bikes)
    plt.text(count+1, i, pct_string, va='center')
    plt.title('Proportion of Bike shared for Gender')
    plt.xlabel('Count of Bike shared')

Yes....The above horizontal bat-plot shows different proportion of Bike shared for Gender,with male having way more than average(74.6%) and female with 23.3% less than average and others with 2.1%

Proportions of Bike trips for User_type :

In [43]:
# proportion distribution of User_type using plotly
fig = px.pie(bikes1, values='duration_sec',names=bikes1['user_type'],
             title='Proportions of Bikers Duration')
fig.show()

In this Visual a new python library was used called plotly which gave a descriptive proportion of Bike Trip duration(sec) for different user_type with Subscribe coming with 82.4% and Customer having 17.6%.

Distribution of Age in SF:

In [44]:
def histogram_solution_1():
    plt.figure(figsize=(8,6))
    bins = np.arange(0, bikes1['member_age'].max()+2, 2)
    plt.hist(bikes1['member_age'], bins = bins)
    plt.xlabel('Age (Year)')
    plt.ylabel('Count')
    plt.title('Age distribution of Ford GoBike data in SF');
histogram_solution_1()

This distribution of Age is a Long Right-skew plot

In [45]:
# there's a long tail in the distribution, so let's put it on a log scale instead
log_binsize = 0.025
bins = 10 ** np.arange(1.2, np.log10(bikes1['member_age'].max())+log_binsize, log_binsize)
plt.figure(figsize=[8, 6])
plt.hist(data = bikes1, x = 'member_age', bins = bins)
plt.xscale('log')
plt.xticks([10,20,30,35,40,50,70,90,100], [10,20,30,35,40,50,70,90,100])
plt.xlabel('Age (Year)')
plt.ylabel('Count')
plt.title('Log-scale Distribution of Age in SF');

The above Histogram shows the distribution of Age in log transformation proves that majority of age that road bike occur within the ages of (35 years-42 years) which are majorly Youth age, and lesser bike_trip count for those at adult age.

Distribution of Bike Duration(sec) in SF:

In [46]:
def histogram_solution_2():
    plt.figure(figsize=(10,6))
    bins = np.arange(60,10000, 50)
    plt.hist(bikes1['duration_sec'], bins = bins)
    plt.xlabel('Bike Duration (Sec)')
    plt.ylabel('Count')
    plt.title('Distribution of Bike Duration(sec) in SF');
histogram_solution_2()

This distribution of Bike Duration(sec) is a Long Right-skew plot

In [47]:
# there's a long tail in the distribution, so let's put it on a log scale instead
log_binsize = 0.025
bins = 10 ** np.arange(1.4, np.log10(bikes1['duration_sec'].max())+log_binsize, log_binsize)
plt.figure(figsize=[10, 6])
plt.hist(data = bikes1, x = 'duration_sec', bins = bins)
plt.xscale('log')
plt.xticks([50, 60, 70, 100, 250, 450,600,1000,8000],
           [50, 60, 70, 100, 250, 450,600,1000,8000]) 
plt.xlabel('Bike Duration (Sec)')
plt.title('Log-scale for Duration_sec')
plt.show()

Using a define scale shows that higher Bike duration occured more between 2.5k-10k sec

Discuss the distribution(s) of your variable(s) of interest. Were there any unusual points? Did you need to perform any transformations?

The distribution of Age was a Right-skew plot which was very jam-packed in terms of data points,and a log-transformation was inserted to show a well describe plot of data points which shows Age occured between 25-42 years of Age during bike Trip.

Of the features you investigated, were there any unusual distributions? Did you perform any operations on the data to tidy, adjust, or change the form of the data? If so, why did you do this?

Yes, Data wrangling was carried out firstly because of messy/untidy data present, such as missing data,drop duplicated data and most especially convert DataTypes.

Unusual distribution were present in the aspect of Duration distributed across different time scale and was highly difficult to read data point but using log-transformation with well defined ticks majority of bike Trip duration occured between 2.5k-10k sec.

Bivariate Exploration

Correlation between Member_age and Duration (2 Quatitative variables):

In [48]:
numeric_vars = ['member_age','duration_sec']
categoric_vars = ['user_type', 'member_gender', 'bike_share_for_all_trip']
In [49]:
# correlation plot using Heatmap
plt.figure(figsize = [8, 5])
sb.heatmap(bikes1[numeric_vars].corr(), annot = True, fmt = '.3f',
           cmap = 'vlag_r', center = 1)
plt.title('Correlation between member_age and duration')
plt.show()

From the Heatmap color range shows they were NO much correlation between member_age and duration_sec

In [50]:
def scatterplot_solution_1():
  sb.regplot(data = bikes1, x = 'duration_sec', y = 'member_age');
  # plt.plot([10,60], [10,60]) # diagonal line from (10,10) to (60,60)
  plt.xlabel('duration.(Sec)')
  plt.ylabel('member_age. (Years)')
  plt.title('Correlation between duration and member_age')  

scatterplot_solution_1()

This Scatter plot shows slight positive relationship between duration_sec and member_age(Years),though we were having over plotting of data points,but from the visual shows age between 25-54 Years tends to Travel more duration Trip in sec.

In [51]:
# plot matrix: plotted to avoid overplotting and showing clearer vies of the numeric data.
print("bikes1.shape=",bikes1.shape)
bikes1_samp = bikes1.sample(n=500, replace = False)
print("bikes1_samp.shape=",bikes1_samp.shape)

g = sb.PairGrid(data = bikes1_samp, vars = numeric_vars)
g = g.map_diag(plt.hist, bins = 20);
g.map_offdiag(plt.scatter);
bikes1.shape= (174952, 19)
bikes1_samp.shape= (500, 19)
In [52]:
# plot matrix of numeric features against categorical features.
bikes1_samp = bikes1.sample(n=2000, replace = False)


def boxgrid(x, y, **kwargs):
    """ Quick hack for creating box plots with seaborn's PairGrid. """
    default_color = sb.color_palette()[0]
    sb.boxplot(x=x, y=y, color=default_color)

plt.figure(figsize = [10, 10])
g = sb.PairGrid(data = bikes1_samp, y_vars = ['member_age','duration_sec'], x_vars = categoric_vars,
                height = 3, aspect = 1.5)
g.map(boxgrid)
plt.show();
<Figure size 720x720 with 0 Axes>

Which Riders Were Bike Highly Shared to?:

In [53]:
sb.countplot(data = bikes1, x = 'member_gender', hue = 'user_type')
plt.ylabel('counts of Shared bike')
plt.title('Relationship of Gender Associated with Riders');

The Clustered Bar-plot depicts that majority of Bikes shared in San Fracisco were Male gender and happen to be more of subscriber with very low customer type.

Visualization of correlations within variables using heatmap:

In [54]:
sb.heatmap(bikes1.corr(), annot=True);

This Heatmaps shows high correlation of color points of 0.075 between member_age and Start_station_latitude/End_station_latitude, which depicts that majority of Bikers where from the NORTH-SOUTH position of San Fancisco.

Talk about some of the relationships you observed in this part of the investigation. How did the feature(s) of interest vary with other features in the dataset?

The relationship observed in this part of investigation was between 2 variables:

Firstly I looked closely the relationship between member_age and duration_sec using Heatmaps showing that the relationship between was really low with 0.006 color point,also checking out member_age and duration_sec using scatter plot show slightly positive correlation between both variables with more age occuring between 25-54 years.

Secondly,The Clustered barchart depicts that more male bikers where common in which majority where subscriber.

Did you observe any interesting relationships between the other features (not the main feature(s) of interest)?

Yes,The other gender did not participate in the bike trip duration irrespective of whether it was a customer or subscriber user_TYPE.

Multivariate Exploration

What Ages of Gender that Travel more Trip?

In [55]:
# scatter plot showing a better description of relationship between duration & age by Gender
fig = px.scatter(bikes1, x="duration_sec", y="member_age", size="duration_sec" ,
                 color=bikes1['member_gender'],
                 title='Correlation between duration and member_age by Gender',log_y=True, size_max=20)
fig.show()

Plotly a python library was used here because i needed to depict every data points between duration and member_age making the size to be the duration_sec and third variable be the color to depict the gender.The scatter plot show that people whose travel more trip above 25k(sec) are between the age 25- 42 years(mostly male) and less that 20k(sec) lies between 50years-140years though there are few Youth(18-30years) who travelled less than 20k(sec).

Scatter plot Encoding via Shape:Which Riders Travel more Trip and their Ages?

In [56]:
sign_markers = [['Customer', 'o'],
               ['Subscriber', 's']]

for sign, marker in sign_markers:
    bikes_sign = bikes1[bikes1['user_type'] == sign]
    plt.scatter(data = bikes_sign, x = 'duration_sec', y = 'member_age', marker = marker)
plt.legend(['Customer','Subscriber'], title = 'user_type')
plt.xlabel('Duration(sec)')
plt.ylabel('Age(years)')
plt.title('Age and Duration by user_type');

The above scatter plot shows the relationship between member_age and duration trip in seconds, using User_type as a means of preference,from the plot it shows there are lots of data points(congestion) between subscriber and customer within the time frame of (0-20ksec) and are within the ages of (28-45years)

In [57]:
# Faceting in two direction to avoid overplotting
g = sb.FacetGrid(data = bikes1,col ='user_type')
g.map(plt.scatter,'duration_sec', 'member_age');

Splitting user_type into different subplot using Facet

In [58]:
# scatter plot showing a better description of relationship between duration & age by user_type
fig = px.scatter(bikes1, x="duration_sec", y="member_age", size="duration_sec" ,
                 color=bikes1['user_type'],
                 title='Correlation between duration and member_age associated with Riders',log_y=True, size_max=20)
fig.show()

The above scatter plot using plotly shows that majority of the Riders are who travelled more trip in seconds are those Subscribe Type within the age of 28-57years.

Talk about some of the relationships you observed in this part of the investigation. Were there features that strengthened each other in terms of looking at your feature(s) of interest?

The first Relationship observed here is the Correlation between member_age and Duration by Gender and also by user_type,and Features of interest were more Strengthened between the member_age and duration by user_type strong correlation were present and lots of data point in the relationship.

Were there any interesting or surprising interactions between features?

Yes,colition of data points between features were out-bursting.

Conclusions

From the Exploratory Data Analysis above several findings and observation has been carried within one,two or more variables in the bike trip datasets to prove different outcome of interests. Firstly,knowing the most start_date for majority of bike shared in San Francisco and the most frequent Start_station located in San Francisco not only that, findings were also carried out using an histogram to know the distribution of Age and Duration(sec) and included the proportion rate using pie-chart for user_type and different Gender that bike was distributed to in San Francisco which happen to be more Male in which majority of those male where for the Subscriber user_type and several relationship were also carried out using scatter plot in the bivariate and multivariate to show relationship between various features such as member_age and duration_sec using User_type and member_gender as a point of preference.